(Day 5) 邏輯迴歸 (Logistic Regression)

2025 iThome 鐵人賽

DAY 5

AI & Data

30 天入門常見的機器學習演算法系列第 5 篇

17th鐵人賽

Alan Hsieh

2025-08-05 00:01:55

391 瀏覽

分享至

邏輯迴歸 (Logistic Regression) 是一種常見的分類模型，主要用於預測二元分類或多元分類，有別於先前的線性迴歸是用來預測無邊界的連數據值，而邏輯迴歸間單來說就是預測有邊界的不連續數值，如 [0, 1], [1, 2, 3]。

模型介紹

模型邏輯與核心概念

那邏輯回歸是如何運作? 其實不論是哪種邏輯迴歸，底層都是先透過線性迴歸來預測，只是分別透過不同的激活函數與損失函數來處理，但是邏輯迴歸一般來說還是比較常用於二元分類，來看看以下流程:

假設有一條線性迴歸方程式: $\hat{y} = \beta_0 + \mathbf{x}^\top \boldsymbol{\beta}$。(注意: 這條不是最佳的線性迴歸線)
會針對前述的線性迴歸方程式結果，透過 sigmoid 函數，將結果轉換成 [0, 1]
假設損失函數 (Cost Function): Binary Cross Entropy
最後使用梯度下降 (Batch Gradient Descent) 來最小化損失函數，找出最佳的邏輯迴歸線

以上就是二元分類邏輯迴歸的原理，那麼我們來看看多元分類邏輯迴歸是如何處理

假設有一條線性迴歸方程式: $\hat{y} = \beta_0 + \mathbf{x}^\top \boldsymbol{\beta}$。(注意: 這條不是最佳的線性迴歸線)
會針對前述的線性迴歸方程式結果，透過 softmax 函數，將結果轉換成機率總和為 1 的組合
假設損失函數 (Cost Function): Categorical Cross Entropy
最後使用梯度下降 (Batch Gradient Descent) 來最小化損失函數，找出最佳的邏輯迴歸線

可以看出不同的邏輯迴歸，只是分別透過不同的激活函數與損失函數來處理，雖然邏輯迴歸可以用於多元分類，但是一般來說還是比較常用於二元分類。

模型評估指標

Accuracy: 整體正確率
Precision / Recall / F1-score: 評估正例預測品質與召回
ROC-AUC: 考量不同閾值下模型分類能力
Confusion Matrix: TP、TN、FP、FN 分佈
Log Loss: 概率預測與實際標籤差異

適用情境

Target 為二元分類 (0/1、是/否) 或多元分類
需要同時獲得概率估計與可解釋性

限制條件

多重共線性: 高度相關特徵會影響係數穩定性
極端值敏感: 離群點可能顯著扭曲模型

模型實作

這個案例開始為了讓讀者有更好的感覺模型的過程，會分別使用 sklearn 與 PyTorch 來建模。但是必須先聲明，無論是手動撰寫或是透過 PyTorch 來模擬出來，都不一定有辦法比 sklearn 提供的演算法來得更優秀，所以除非有特殊目的，否則使用 sklearn 提供的演算法效能與準確性都會較高。

因本篇實作的目的，是為了更好的體現出 Logistic Regression 的原理，所以說明會以 PyTorch 實作為主。

Sklearn 實作

從以下範例可以看出來，使用 sklearn 提供的方法簡單又快速，但是很難直觀的看出來他的原理，就也是 sklearn 的缺點，高度封裝帶來的便利性，也導致很多細節無法操作控制。

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report


X, y = make_classification(n_samples=1000, n_features=5, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LogisticRegression()
model.fit(X_train, y_train)

# 預測
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# 評估
print(f"Accuracy: {accuracy_score(y_train, y_train_pred):.2f}")
print(classification_report(y_train, y_train_pred))

print(f"Accuracy: {accuracy_score(y_test, y_test_pred):.2f}")
print(classification_report(y_test, y_test_pred))

PyTorch 實作

使用 PyTorch 來做同一件事情，明顯的可以感覺到，比 sklearn 複雜很多，但是我們也可以透過這段程式碼來窺探 Logistic Regression 的細節。

如前面所說的，因為 Logistic Regression 是先使用 Linear Regression 方程式預測出一個結果，再藉由 Sigmoid 函數逕行轉換。以下是用單層神經網絡，來模擬 Logistic Regression 的演算法，所以只定義 output layer 為線性輸出層 + Sigmoid function。

# 模型定義

class LogisticRegressionModel(nn.Module):
    def __init__(self, input_dim):
        super(LogisticRegressionModel, self).__init__()
        self.linear = nn.Linear(input_dim, 1)
        self.sigmoid = nn.Sigmoid()  # 1 個輸出 (sigmoid 之前)

    def forward(self, X):
        return self.sigmoid(self.linear(X))

定義完模型的基礎架構後，來定義 Criterion 為 Binary Class Entropy 與 Optimizer 為 Batch Gradient Descent，讓整個架構符合我們前面介紹的核心概念。

# 設定 criterion 與 optimizer

criterion = nn.BCELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

從訓練代碼搭配前述的模型架構邏輯，可以看出來這個步驟也與前面模型介紹時呼應:

會先有一條 Linear Regression 進行預測，輸出一個結果
前述的結果，會再透過 sigmoid 進行處理
使用 Binary Cross Entropy 來衡量損失
最後使用梯度下降 (Batch Gradient Descent) 來最小化損失函數，找出最佳的邏輯迴歸線

# 訓練

n_epochs = 1000
for epoch in range(n_epochs):
    model.train()  # 告訴模型進入: 訓練模式
    optimizer.zero_grad()  # 梯度歸零，預設梯度是累加的 (accumulated)，必須在每一回合開始前清空上一輪的梯度，否則梯度會疊加，導致錯誤的參數更新
    y_pred = model(X_train_tensor)  # 前向傳播 (forward pass)
    loss = criterion(y_pred, y_train_tensor)  # 使用定義好的損失函數來計算預測值和真實標籤之間的損失
    loss.backward()  # 反向傳播 (Backward Pass)，這些梯度會儲存在每個參數的 .grad 屬性中
    optimizer.step()  # 根據梯度更新模型參數
    
    if (epoch+1) % 100 == 0:
        print(f"Epoch {epoch+1}/{n_epochs}, Loss: {loss.item():.4f}")

完整程式碼

import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import numpy as np

# 資料產生與前處理
features_size = 5
X, y = make_classification(n_samples=100000, n_features=features_size, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_scaler = scaler.fit_transform(X_train)

# 轉換為 tensor

X_train_tensor = torch.tensor(X_train_scaler, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.reshape(-1, 1), dtype=torch.float32)

# 模型定義

class LogisticRegressionModel(nn.Module):
    def __init__(self, input_dim):
        super(LogisticRegressionModel, self).__init__()
        self.linear = nn.Linear(input_dim, 1)
        self.sigmoid = nn.Sigmoid()  # 1 個輸出 (sigmoid 之前)

    def forward(self, X):
        return self.sigmoid(self.linear(X))

model = LogisticRegressionModel(input_dim=features_size)

# 設定 criterion 與 optimizer

criterion = nn.BCELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# 訓練

n_epochs = 1000
for epoch in range(n_epochs):
    model.train()  # 告訴模型進入: 訓練模式
    optimizer.zero_grad()  # 梯度歸零，預設梯度是累加的 (accumulated)，必須在每一回合開始前清空上一輪的梯度，否則梯度會疊加，導致錯誤的參數更新
    y_pred = model(X_train_tensor)  # 前向傳播 (forward pass)
    loss = criterion(y_pred, y_train_tensor)  # 使用定義好的損失函數來計算預測值和真實標籤之間的損失
    loss.backward()  # 反向傳播 (Backward Pass)，這些梯度會儲存在每個參數的 .grad 屬性中
    optimizer.step()  # 根據梯度更新模型參數
    
    if (epoch+1) % 100 == 0:
        print(f"Epoch {epoch+1}/{n_epochs}, Loss: {loss.item():.4f}")

# 預測

X_test_scaler = scaler.transform(X_test)
X_test_tensor = torch.tensor(X_test_scaler, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.reshape(-1, 1), dtype=torch.float32)

model.eval()
with torch.no_grad():  # 停用自動微分 (autograd)，不會追蹤張量的計算圖，減少記憶體消耗與加速推論，是測試與部署時的必備寫法，訓練需要追蹤梯度，測試不需要，這行就是明確告訴 PyTorch「這是測試」
    # 預測機率
    y_train_prob = model(X_train_tensor)
    y_test_prob = model(X_test_tensor)
    
    threshold = 0.5

    y_train_pred = (y_train_prob > threshold).int().numpy()
    y_test_pred = (y_test_prob > threshold).int().numpy()

# 評估
print(f"\n[Train Accuracy] {accuracy_score(y_train, y_train_pred):.2f}")
print(classification_report(y_train, y_train_pred))

print(f"\n[Test Accuracy] {accuracy_score(y_test, y_test_pred):.2f}")
print(classification_report(y_test, y_test_pred))

執行結果

[Train Accuracy] 0.87
              precision    recall  f1-score   support

           0       0.86      0.88      0.87     40060
           1       0.88      0.85      0.86     39940

    accuracy                           0.87     80000
   macro avg       0.87      0.87      0.87     80000
weighted avg       0.87      0.87      0.87     80000


[Test Accuracy] 0.87
              precision    recall  f1-score   support

           0       0.85      0.88      0.87      9954
           1       0.88      0.85      0.86     10046

    accuracy                           0.87     20000
   macro avg       0.87      0.87      0.87     20000
weighted avg       0.87      0.87      0.87     20000

結果評估

我們就直接以 PyTorh 執行結果來做分析，因為這個案例是直接使用假資料做，所以其實結果蠻漂亮的，大致上可以分析出一下的情況:

模型在訓練集與測試集上沒有出現明顯 overfitting
正負類別表現對稱性高 (precision/recall 差異不大)
沒有出現模型偏向某一類別的現象 (常見於不平衡資料)
Macro avg ≈ Weighted avg → 支援資料分布幾乎平衡

下一步建議

Logistic Regression 是線性模型，若資料有非線性邊界，它無法擬合，雖然目前表現不錯，但這是 make_classification 的功勞，套用到真實的數據集可能要進一步評估
雖然 print loss，每 100 epoch 顯示，但未檢查是否有震盪、過早收斂等現象
Threshold 直接設為 0.5，沒有評估決策閾值的調整，應該進一步評估

結語

Logistic Regression 作為分類任務的入門模型，看似簡單，卻蘊含著機率思維、最佳化原理、以及模型評估等多層次概念。它在許多實務應用中仍佔有一席之地，尤其當我們追求的是高可解釋性與高效率的解法，例如信用評分、醫療預測或風險判斷等領域，Logistic Regression 往往仍是首選。

本篇除了以 sklearn 快速建模，也特別採用 PyTorch 從底層還原 Logistic Regression 的運作過程，目的不在於提升準確率，而是釐清這個模型「為什麼有效」、「如何運作」、「可以控制哪些面向」。這種深入理解將為後續進入更複雜的分類模型奠定紮實基礎。